I) A matter of point of view : Independent points approch

Quick data exploration

In this homework the objective is to caracterize the running habits of a person. To do so we dispose of the tracks of 60 running sessions. Each track is a sucession of location for witch we dispose of the exact date. Little is known of the data collection process, and so it’s hard to evaluate their representativity of the actual continous running paths. The time between tracks is not constant. The time between two tracks generaly oscilate between 1 and 10 seconds. Both the speed and the distance bewteen points are even more irregular.

Some points have a posterior mean speed (norm of the mean of the speed vectors between t-1 and t) way higher or lower than the others as can be seen in the boxplot of the log of the mean and the speed. Similare observations can be made for the distance. We can only extrapolate on the origine of those variations (different ways of transportation, errors …) , and it’s not the objective of this homework to explain them. But it semmed importante to us to higthlight their existance as they may introduce a bias in the density estimation if some locations recieved more points during one passage.

Favorite running places using mean shift

Bendwith estimation

In this part we try to estimate the density corresponding to the runner presence using a gaussian kernel with a diagonal bendwith. We choose this type of bendwith to speed up the process as the number of observation is quite important. To select the bendwith we first tried to use a Biased Cross-Validation validation with a staring matrix proportional to the variance. The function is the one implemeted in the package ks. We had to apply the function to a sample of the point, once again for speed reason.

##            [,1]         [,2]
## [1,] 3.7791e-06 0.000000e+00
## [2,] 0.0000e+00 4.608002e-07

But the result is far too small. It resorted from a quick internet search that it was a common problem for spatial data because of their heterogeneity. We resorted to choosing h manualy by setting a square bandwith and decreesing it progressivly. We settled with a (0.1,0.1) bendwith.

Final density estimation

Finding the modes with Mean Shift

Then we use the mean shift method to find the modes. We tried several parametrisations before settling on this one.

Should we have tried to correct the possible sampling bias ?

Going back to the initial observation about the uncertain processus of sampling for the points. We tried to see if it could have affected the results or if it was negligable o). So we reestimated the density using the distance as a weight to give more prevalence to the points where we thought the sampling was scarcer. Of course this distance do not take into account the curvature of the road but we hoped that the points were close enough for it not to be a major issue.

Estimating the denstity with

Theorically we should estimate the bandwith once again with de weights. But the previous issue of a data structure that induced a very small bandwith is not solved by introducing a weight. Hence we kept h=(0.1,0,1) as the bandwith.

## Loading required package: viridisLite

There is no sgnificative difference as far as the graphic can show. So we considered taking into account those weight not necessary for what follows.

II) Paths approch

Previously we considered the points as if their weren’t each part of a running session. In this part we try to take the running session into account.

The problem of the distance (and of infite dimension)

To consider tracks we have to associate them with a mathematical object. For track points the association is obvious. It’s a bit trickier with traks. What is proposed it to consider them as function. We will consider them to be function from [0,1[ to C such as lim(f) in 1 is f(0). We choose it because we noticed that the tracks started and ended all in the same point. With this choice we have to deal with a couple of problems.

First the ensemble of function is infinite, hence normes are not equivalent, hence the convergence for one measure of error does not guaranty the convergence for a other (at least to the extent of our limited knowledge of infinte space). Basically we have no idea of what is theorically going on.

Which lead to rather speculative considerations regarding the distance to use. We will use the H distance for lack of a realislticly computable (by us) other idea. But are still bothered by aspects such as :
; image:
; This is partiulaty worrisome if we consider the contraints that roads put into the path. Most function from [0,1[ -> C we consier are actually not possible at all. Intuitivly we’d say that B and C are more similar than C and A because they share a common road for most of there tracks.( It is of course true if we are interested in localisation, A and C are more similar in terms of shape.) That is why we considered an other distance :
; image:
;

But two problems made us go back to H distance. First the estimation of said distance is very clearly not a distance and we don’t know if it is a problem. Secondly, the compuational time of said distance is too long to use in the (already too long) means shift. As the distance between A and B need the calcul of he distance between each point of A and the segments defined by following points of B in order to find the distance of between each point of A and the polygone formed by the points of B. Before taking the mean of those distances as evalutaion of the distance between A and B.

Conputing the new density

We use the benwith of the previous part as epsilon.

Let’s take a look at the top-5 and the least-5 path

Now let’s (try to) find clusters

The return of problems

The problem of the choice of the mathematical represenation strikes back here. If we were to consider the track as true functions and proceed with mean shift with functional addition (by approximating the tracks by n-gones with n the number of point of the track) the time would get a far too great importance in our opinion. From the begining of the analisys What fondamently matters was where the runner goes not in witch order. Hence it’s the image of the functions into C that we are truly interested in If we want to take the time into account we need to change the distance because it’s not taking the time into account. For example, if to tracks follows the same path in opposite sense they are associated with two complete different functions which may be differnte given any t, yet the H distance is null. If we were to run mean shift in those circunstance here is what could happen.
; image:
; But without it what addition to consider ? We had to somehow go back to point level but with a density herited from the track level. The principal problem with it is that tracks are only translated and not deformed.

The results we got